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1. INTRODUCTION 

Catching up the changing world, regarding technology advancement and life style, almost all 
organizations have embraced tools and techniques to derive useful knowledge from a pool of transactional 
data. This also applies to the context of higher education, where conventional and new sources of such a data 
have an important role to play [1], [2]. These include a simple student grading profile that is normally 
obtained from a university registration system, history of course enrollment, and student logs with online 
learning sessions [3], [4]. To a university and alike education institutes, this has proven critical to maintain 
competitive and meet expectations of young generation and the government. In particular to the study of [5], 
the trend of applying data mining that is recently renewed to a general concept of data science, to various 
educational data and problems keeps increasing over the years. With this methodology of educational data 
mining (EDM) [6], [7], effective planning and decision making can well be improved by transferring a 
goldmine of data specific to each university to working knowledge about student behavior, preferences of 
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learning methods and materials, communication channels and other factors to their achievement. Examples of 
past development include the prediction of student performance, recommendation systems for courses or a 
personalized learning plan, determination of atypical learning patterns and causes [1 ], [8]. 

Drilling down to the topic of student performance or achievement, a number of previous studies 
exploit newly customized and existing data mining models to commonly demonstrate the benefits of 
identifying students at risks. Given this, a university may be able to act quickly or even prevent undesirable 
events to take place, hence reducing the damage to both student and university. The work of [9] focuses on 
inventing a predictive model that accurately categorize new students to different programmes of student 
retention on campus. In addition, others [10]-[12] also propose models that determine groups of students with 
distinct preferences. Such a division leads to appropriate policy and treatment being implemented to ensure 
student retention. Similar to these, there are other investigations that make use of a range of data mining 
methods to modeling student performance and dropout. These include supervised learning models like Naive 
Bayes classifier [13] and decision tree [14], [15], with an unsupervised learning approach like k-means [16] 
being an efficient alternative for a big set of data. 

For Mae Fah Luang University (MFU) and other universities in Thailand, the problem of student 
retention has gained a great deal of attention. It is due to the country moves closer to the aging society, with 
the ratio between young and old population groups is geting smaller and smaller, hence less students will 
pursue higher education. This is also motivated by initial attempts [17]-[21] that make use of basic 
classification algorithms, and another set of studies by [8], [22] that explores both existing methods and their 
extensions. According to [8], a new data transformation is introduced prior the usual classification process. 
For that, the concept of consensus clustering [23]-[25] is adopted to transform an original data to the 
corresponding matrix with sample-cluster-relation embedding. Instead of modeling student performance 
solely as a classification problem, it might be feasible to include an unsupervised model like data clustering 
to determine the obvious cases, before forwarding the rest to a more complex classifier. Of course, this makes 
the training procedure more efficient with less samples. Besides, it might help to solve another difficulty of 
class imbalance, which is rather common as the amount of at-risk students 1s often much smaller than that of 
the other group. As such, this paper introduces a bi-level learning framework that first relates a new case to 
one of the pre-defined clusters. Then, for a particular cluster that sees almost all of its members belonging to 
one class, a pattern of student graduation can be justified right away. On the other hand, for a cluster with 
low purity, the prediction is produced by the cluster-specific classifier. 

The proposed framework is exploited to determine the graduation patterns, or whether a student 
finishes the enrolled programme within a regular period of 4 years or else. This knowledge provides an 
opportunity for students together with advisors to adjust the plan of courses, which may help the student to 
perform better or graduate on time. This model is designed in such a way that it is applicable for different 
programmes across schools at MFU. To be precise, courses are groups to categories that are common to all 
students, thus generalizing the target learning model. For the current research, the framework is evaluated 
with a real data collection, which covers students graduating in 2016. The rest of this paper is organized as 
follows. Section 2 presents the research methodology of this study, including details of the data mining 
process, investigate data collection, and the proposed framework of bi-level learning. After that, experiment 
design, the corresponding results and discussion are provided in section 3. The paper is then concluded in 
section 4 with a perspective of future research. 


2. METHOD 

This research follows those data mining or data science studies, especially those focusing on EDM 
[8], [9], [20]. In particular, the target data is firstly identified, followed by the preparation stage that ensures 
the readiness and quality of final data set. Having completed this, the bi-level learning framework can be 
described, with respect to characteristics of the data under investigation. These issues are discussed in the 
following sections. 


2.1. Data acquisition and preparation 

In order to obtain an effective framework, it is designed based on transactional data maintained in 
the MFU registration system. Due to the concern of data privacy, the current project is to initially exploit 
only academic records of those undergraduate students who graduated in 2016 (or 2559 in B.E.). This 
population consists of 1,162 cases from 2 schools of management and information technology. The retrieval 
of these is subjected to conditions that a selected sample has to complete the number of required courses for 
three subject categories. These include general education course, specific required course, and free elective 
course, respectively. Moreover, those belonging to students with a record of programme transfer or exchange 
are excluded. 
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Within the registration database, two important tables from which the target data is retrieved are 
shown in Figure 1: ‘Student personal information’ and ‘Student enrollment information’. In the former, each 
student is represented with personal identification number (ID), year of entry that specified in B.E., name of 
school that administrates the enrolled programme, and graduation GPAX. The latter describes a number of 
enrolled courses, course categories and the grades achieved. Given these, the target data can be obtained by 
joining the aforementioned two tables by student IDs. Following that, the ‘Student data for analysis’ table in 
Figure 1 can be generated by collapsing multiple rows of a single student (each representing one course) to 
one record. For such a purpose, course names are ignored, whilst frequencies of different grades (1.e., A, B+, 
B, C+, C, D+, D, F, P, S, U, V, and W) are accumulated. Note that three sets of grade frequencies are formed, 
one for each course category. Table 1 represents details of these sets of grade frequencies that are considered 
attributes or features of the intermediate data. 


Student ID. Student D.| subjects studied | Cowse | -~ [Grades 
523x000 2.00 | |523xxxxx [Introduction to Software Engineering [Specific Requirement | | D 
S3 2.10 | [52311x [Database Systems (Specific Requirement |. | B _| 
553o 240 | [52311x [Practical Ceramic Froe Elective | | Be 
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Student data for analysis 


Figure 1. Initial target data (‘Student personal information’ and ‘Student enrollment information’ tables) and 
the initial data preparation procedure that produces the final data set (1.e., “Student data for analysis’ table) 


Having obtained this intermediate data set, the following pre-processing steps are needed to create 
the final data set, which will be analyzed using the proposed framework. 
(1) Each grade frequency such as Al, A2 and A3 in Table 1 is normalized such that its value domain is 
transformed to be within the range of [0, 1]. This is to ensure the absence of biases among different attributes 
in the analyzing process (1.e., these data attributes are equally important). Furthermore, it helps to overcome 
the problem that different programmes may consist of different number of courses in those three categories. 
As a result, the normalization of each grade frequency fyi in the category x is defined as fxi», which can be 
estimated by the following. 


fer = 4 — (1) 


dvi Ex fxj 


(11) Then, the attribute ID is removed in order to protect the privacy of personal information. 

(11i) At last, the attribute YEAR that represents the entry year in B.E., is transformed to a number of year each 
student has spent in the programme before graduation. Note that those students that graduate in year y 
actually started the programme in year y - 3 or before that. Given this knowledge, the new value y+ of YEAR 
attribute for a student k (where k=/, ..., N; N=1,162) can be calculated from the entry year yx as follows, 
where Ygraa is the year of graduation and set to 2559 in B.E. (or 2016 as mentioned earlier) for the present 
study. Note that other data batches may be available for future studies, where can be Yg;qq can be 2559, 
2560, 2561 and so on. 


Yk* = Ygraa T Yk + 1 (2) 
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After these steps of data preparation, the final data set is composed of N=1,162 samples, and D=40 
features or attributes. These can be summarized as follows. In that, 911 samples belong to School of 
management, and the other 251 cases represent students from School of information technology. 

— 13 normalize grade frequencies in the category of specific required courses; d;, ..., d3 in [0, 1]. 

— 13 normalize grade frequencies in the category of free elective courses; dz, ..., d26 in [0, 1]. 

— 13 normalize grade frequencies in the category of general education courses; d27, ..., d39 in [0, 1]. 

— YEAR that is now the number of years before graduation; d4 in {4, 5, 6, 7}. It is noteworthy that the 
minimum numebr of years anyone at MFU has to be in a programme is 4 years. Also, it is possible for a 
student to spend up to 7 years in a specific programme before graduation. 


Table 1. Description of student information and different grading frequencies (1.e., the intermediate data) 


No Attribute Name Description 
1 ID Student identification number 
2 YEAR Year of entry (in B.E.) 
3 Al Number of grade A obtained from specific required courses 
4 BBI Number of grade B+ obtained from specific required courses 
J B1 Number of grade B obtained from specific required courses 
6 CCl Number of grade C+ obtained from specific required courses 
7 C1 Number of grade C obtained from specific required courses 
8 DD1 Number of grade D+ obtained from specific required courses 
9 D1 Number of grade D obtained from specific required courses 
10 Fl Number of grade F obtained from specific required courses 
11 Pl Number of grade P obtained from specific required courses 
12 S1 Number of grade S obtained from specific required courses 
13 Ul Number of grade U obtained from specific required courses 
14 Vi Number of grade V obtained from specific required courses 
15 WI Number of grade W obtained from specific required courses 
16 A2 Number of grade A obtained from free elective courses 
17 BB2 Number of grade B+ obtained from free elective courses 
18 B2 Number of grade B obtained from free elective courses 
19 CC2 Number of grade C+ obtained from free elective courses 
20 C2 Number of grade C obtained from free elective courses 
21 DD2 Number of grade D+ obtained from free elective courses 
22 D2 Number of grade D obtained from free elective courses 
23 F2 Number of grade F obtained from free elective courses 
24 P2 Number of grade P obtained from free elective courses 
25 52 Number of grade S obtained from free elective courses 
26 U2 Number of grade U obtained from free elective courses 
Zi V2 Number of grade V obtained from free elective courses 
28 W2 Number of grade W obtained from free elective courses 
29 A3 Number of grade A obtained from general education courses 
30 BB3 Number of grade B+ obtained from general education courses 
31 B3 Number of grade B obtained from general education courses 
32 CC3 Number of grade C+ obtained from general education courses 
33 C3 Number of grade C obtained from general education courses 
34 DD3 Number of grade D+ obtained from general education courses 
35 D3 Number of grade D obtained from general education courses 
36 F3 Number of grade F obtained from general education courses 
37 P3 Number of grade P obtained from general education courses 
38 S3 Number of grade S obtained from general education courses 
39 U3 Number of grade U obtained from general education courses 
40 V3 Number of grade V obtained from general education courses 
41 W3 Number of grade W obtained from general education courses 


2.2. Model development 

This section presents the process of model development, including cluster analysis that is conducted 
initially to observe the grouping structure within the final data set, and details of the proposed bi-level model 
with its evaluation being reported in section 3. 


2.2.1. Initial cluster analysis 

At first, it is trivial to observe the structure of data whether it is appropriate to develop the desired 
bi-level learning framework. In other words, after applying a clustering algorithm to the data set, there should 
be a cluster that is pure or almost pure (1.e., almost all samples in a cluster belong to the same class). Besides, 
there also are other clusters of the same clustering result that are nor pure, and needed additional classifiers to 
justify an appropriate class of their members. The final data set X is further divided into two subsets of school 
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specific samples: X; for School of management and X2 for School of information technology. To accomplish 
this, k-means clustering algorithm is applied to the final data set X, for M times, for a particular number of 
clusters k. These multiple trials are required to draw a reliable conclusion from a non-deterministic model 
like k-means. For each run p=/ ... M, the result C*, is assessed with two well-known validity indices of DB 
and Dunn (see [5] and [6] for more details). For each C*,, there will be two measurements of DB‘, and 
Dunn*,. Then, the averages across M runs can be estimated and presented as DB*,» and Dunn*,+, respectively. 


kK = Yp=1,..MDBp 
DB” = 7 ; 
k= Yp=1,..M Dunny 
Dunn,» = = | | 


The aforementioned procedure is repeated for a range of different k values, 1.e., k in {2, 3, ..., kmax}. 
As such, the optimal k is selected from this range as the value that provides the best values of DB*,» and 
Dunn*,». To accomplish this, a rank-based approach is exploited such that the parameter k with the minimum 
overall ranking score (RS") is preferred. As a low DB measure indicates a good clustering, DB>. for diffent k 


values are ranked from minimum to maxmum. Given this ranked list, the k-specific ranking score RSK, can 
be determined, where the first in this list is assigned with 1 and the last with kmax-1. In case of a tie, the 
average of ranking score is given to related parties. Likewise, the k-specific ranking score RS% nn can also be 


estimated from the ranked list, in which high Dunni, measures appear in the front as they represent better 


clustering than those with lower Dunn values. Provided these, the overall ranking score specific to k can be 
simply calculated as follows. After that, the optimal k value is identified with the minimum RS*,k € 


ER cs 
RS" = RSi t RSi nn (5) 


With kmax being 10, clustering results with two clusters (or k=2) proves to be better than those using 
other k values. Figures 2 and 3, for School of management and School of information technology, illustrate 
the two clusters that are obtained from the trial with the best quality measures. According to Figure 2, Cluster 
1 is almost pure with 444 out of 447 samples (1.e., 99%) having the entry year of 2556 (in B.E.) or YEAR is 
4, while only 1% spends 5 years before graduation. However, with Cluster 0, it is less pure with the majority 
of 85% finishes on time, or YEAR=4. The other 15% is a mixture between samples with YEAR values of 5 
(13%), 6 (1%), and 7 (1%). Similar observations of the two clusters are also obtained with samples of School 
of information technology, see Figure 3 for more details. Henceforth, a clustering process may well be used 
to provide an accurate prediction model for specific clusters, such as those Cluster 1 in both cases. 
Nonetheless, a classifier is also required in addition to the initial clustering for some other clusters, for 
instance Cluster O in Figures 2 and 3. This finding leads to the proposed framework that will be explained 
next. 


School of Management 


Cluster 0 Cluster 1 


2 9, 
5 1° _3, 1% 
3. 1%_ 5, 1% 





Figure 2. The best clustering result with k=2, for samples belonging to school of management 
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School of Information Technology 


Cluster 0 Cluster 1 





Figure 3. The best clustering result with k=2, for samples belonging to school of information technology 


2.2.2. Proposed model 

This section provides details of the proposed framework of bi-level learning, in which both types of 
unsupervised and supervised learning approaches are systematically combined to produce an accurate, yet 
efficient learning and prediction processes. The steps taken to generate or train a model are given as: 
Step 1: For a given specific case q (e.g., school), suppose that Xg train and Xg,tesx are training and test data, 
respectively. The process of model generation will make use of only the former, while the latter is used to 
assess the quality of the resulting model. With a clustering ®, the procedure explained in section 2.2.1 is 
conducted on Xgtrain to find the optimal number of clusters. Then, select among M alternative of clustering 
results with that best k, to represent the knowledge model in the first level. Note that for this stage, the YEAR 
feature is left out such that groups of students can be formulated based solely on grade achievement. This 
problem is designed as a binary classification, with two classes of A (YEAR=4) and B (YEAR > 4). 
Step 2: For each cluster c*, in the clustering C% from Step 1 (where t=/ ... k), its centroids z“, is used as a 
reference for a new sample in the test or prediction phase. Please refer to [20] for details of estimating a 
centroid from cluster members. 
Step 3: Again, for each cluster, find the percentage of majority class among samples in that cluster. The 
analysis process stops only at this clustering level, if that percentage is greater than or equal to a (1.e., a 
predefined value of minimum percentage for a pure cluster). As a result, this cluster represents that majority 
class, which is a prediction of a new instance that is similar to the corresponding cluster centroid. Otherwise, 
a Classifier is to be built using samples of this specific cluster (see Step 4). 
Step 4: When one cluster is not pure up to the expected level of a, samples in that cluster will be used to train 
a classifier using the classification algorithm /. Please note that a conventional feature-based classification 
like a Naive Bayes model can be used here. Please see section 3.1 for all methods that are employed in the 
present investigation. 

After going through those steps explained above, the resulting bi-level model can be exploited to 
predict a class of a new instance in Xq tes: as follows. 
Level 1: For a sample g in the test data X,,1es, find the optimal centroid z*, amongst k alternatives that provides 
the minimum distance to the sample g. This is defined by the following equation. Note that d(.) is a distance 
function, with Euclidean being used in the current research. 


min d(g, Zt) (6) 


Kye 
Zg t=1... 


If the optimal centroid z‘; represents a cluster with the final class prediction (i.e., without additional 
classifier), the predicted class is simply provided. Otherwise, classify the sample g using the cluster-specific 
classifier in Level 2. 

Level 2: Given the sample g, produce a class prediction using the classifier specifically developed for the 
cluster c*, (whose centroid is z4, that is identified earlier in Level 1). 


3. RESULTS AND DISCUSSION 

In this section, the design of empirical study is explained, which includes the investigated data and 
evaluation approach, settings of algorithm parameters, and compared methods. Furthermore, results and 
important findings are discussed in such a way to amend useful information and guideline. 
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3.1. Experimental design 
This experiment makes use of the final data set of 1,162 samples, which is described in section 2.1. 

Two cases are formed regarding two schools where these samples belong to: 1) School of management with 

911 samples, and School of information technology with the other 251. Other settings are listed as: 

a. k-means is used as the clustering algorithm @ in bi-level learning framework, with M=/0 for the 
number of trials to be investigated for a particular number of cluster k. Also, note that k is selected from 
a range of 2 to kmax, where kmax=10. 

b. The minimum level of cluster purity is determined by the proportion of majority class, which is 
specified by the variable a=90%. 

c. Four algorithms are examined as the choice to create the classifier 6 in Level 2 of the proposed model. 
These include: Naive Bayes (using Gaussian distribution for numerical features), K-nearest neighbors or 
KNN (using K € {1,3} as to generalize the findings), Decision Tree (with the maximum depth=10), and 
Random Forest (with the size of forest=20). 

d. 10-fold cross validation is exploited as the evaluation approach here, such that each sample is a member 
of test data once. As such, a confusion matrix is produced for this binary classification problem. 

e. In addition, there are two compared methods that are considered as baseline counterparts of the bi-level 
learning framework. 

f. | Clustering-only prediction, i.e., only Levell in the proposed model is implemented. 

Classification-only prediction, where cluster analysis is not included and a classifier is generated from 
the entire training data set. The same collection of four classification algorithms specified above is also 
examined in this specific use case. 


3.2. Experimental results and discussion 

Based on the design described in the previous section, Table 2 shows the evaluation results of 6 
different models with the case of School of management. Both overall as well as class-specific accuracies € 
[0,100] are exploited here to compare predictive performance of different methods. For instance, the 
accuracy of Class A is estimated as: the number of Class A samples that are predicted correctly devided by 
the total number of Class A samples. In this table, all variants of the bi-level model have higher overall 
accuracies than that of the clustering-only counterpart. In addition, Random Forest (RF) obtains the highest 
overall accuracy of 93.96%. With respect to the accuracy of Class A, all the models are able to generate 
exceptional performance, with RF is the best again. However, for Class B, Naive Bayes (NB) achieves the 
highest accuracy of 79.71%, with RF obtains only at 42.03%. Unfortunately, the clustering-only or Levell 
model is not able to identify any sample of Class B, with resulting in an accuracy of 0%. Another observation 
is with the KNN model performing better with K=1 than a bigger neighbor set of K=3. 


Table 2. Evaluation results with different models, for the case of School of management 


Model Confusion Matrix Class specific Overall 
A B Classified as accuracy acuuracy 
Levell (Clustering only) 839 T2 A 92.10% 
0 0 B 0.00% aay 
Bi-level (Naive Bayes) 792 50 A 94.06% 
14 55 B 79.71% ane 
Bi-level (KNN, K=1) 815 21 A 96.79% 
38 31 B 44.93% EOE 
Bi-level (KNN, K=3) 810 32 A 96.19% 
41 28 B 40.58% PE 
Bi-level (Decision Tree) 819 23 A 97.21% 93.19% 
39 30 B 43.48% S 
Bi-level (Random Forest) 827 15 A 98.22% 93.96% 
40 29 B 42.03% wee 


In addition to the results reported in Table 2, Figure 4 depicts the comparison of accuracies specific 
to Class A, which are achieved by different variations of the bi-level framework (shown in Table 2) and four 
simple classifiers (NB, KNN, DT, and RF are trained with the whole training set). Note that for KNN, results 
with only K=1 are reported since they demonstrate the best performance among different K values. 
According to this, all of the four bi-level variations perform better than their corresponding baselines. For 
instance, the bi-level model implementing RF acquires the accuracy of 98.22%, almost 2% higher than the 
score achieved by a simple RF classifier. The largest improvement is witnessed with the case of NB, with the 
bi-level version reaches 94.06% and a simple NB is only at 88.10%. Likewise, Figure 5 shows a similar set 
of results for the Class-B prediction. This figure suggests that the bi-level framework usually outperforms the 
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corresponding simple classification models. In particular, NB obtains the highest accuracy of 79.71%, while 
the lowest of 42.03% is seen with RF. However, this is still a significant improvement from using a simple 
RF that is accurate at only 27.55%. 





Accuracy (Class A) E Bi-level Accuracy (Class B) E Bi-level 
100.00% 98.22% Simple Classifier 90.00% Simple Classifier 
or 80.00% 
=e 70.00% 
60.00% 
90.00% 
50.00% 
85.00% 40.00% 
30.00% 
27.55% 
80.00% 20.00% 
NB KNN DT RF 
Figure 4. Class-A accuracies obtained by bi-level and Figure 5. Class-B accuracies obtained by bi-level 
basic classifiers, categorized by classification and basic classifiers, categorized by classification 
algorithm exploited for the training process. Note that algorithm exploited for the training process. Note 
the results with KNN are obtained using K=1 that the results with KNN are obtained using K=1 


Similar to Table 2, Table 3 shows details of the evaluation results with the data belonging to School 
of information technology. For the overall accuracy, the bi-level (NB) and the clustering-only model obtain 
the highest and the lowest scores, respectively. The bi-level (RF) is the most effective for Class-A 
classification at 97.17%, while the bi-level (NB) proves to be exceptional for Class B. It reaches a high value 
of 92.31%. These results lead to a conclusion that the proposed framework is more accurate than using only 
the clustering results to guide prediction. Again, the KNN model with K=1 performs better than the other 
using K=3. Besides these, Figures 6 and 7 compare the accuracies obtained by bi-level variations and basic 
classifiers for Class A and Class B, respectively. Like the previous case, trends found with School of 
management also appear here with students from School of information technology. So, the findings that the 
proposed framework is better than simple classifiers and a clustering-only prediction are confirmed by these 
two set of results. In fact, it is generalized and applicable across different schools. 


Table 3. Evaluation results with different models, for the case of School of information technology 


Woda Confusion Matrix _ Class specific Overall 

A B Classified as accuracy acuuracy 
= ones only) r4 m : ne 83.27% 
nn (Naive Bayes) i 7 7 aF 93.63% 
—_— (KNN, K=1) os 7 E o 86.06% 
pieva -~ K=3) - > r 84.06% 
pieve (Decision Tree) A a : o 89.24% 
Bi-level (Random Forest) — o A pi 92.43% 


In order to digest those results further, Figure 8 reveals an important finding regarding the problem 
of class imbalance. According to Tables 2 and 3, the accuracies reported for Class A are usually better those 
of Class B. This is pretty much due to the uneven cardinality of samples belonging to these binary classes. In 
fact, based on the original class distribution for School of management shown in Figure 7, the proportion of 
instances of Class A is 92.10% and only 7.90% of the other. It is slightly better for School of information 
technology, with the ratios being 83.27% and 16.73%. It can be summarized from Figures 6 and 7 that most 
models included in this empirical study exhibit better performance with Class B in the case of School of 
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information technology, compared to the other case. The level of imbalance between classes in the former is 
less than the latter, which may well explain that observation. Another point worth noted here is that the bi- 
level framework can ease the imbalance problem with higher proportions of Class-B samples are included in 
the stage of classification modeling, see Figure 8 for details. Hence, bi-level variants are more accurate than 
their corresponding baseline counterparts, i.e., simple classifiers. 





Accuracy (Class A) E Bi-level Accuracy (Class B) E Bi-level 
100.00% Simple Clas 100.00% Simple Clas 
90.00% 
95.00% 92.92% 92.92% 94.64% 80.00% 
70.00% 
90.00% 
60.00% siai 
50.00% 
85.00% 
40.00% 
a 30.00% 
NB KNN DT RF i oe a ji 
Figure 6. Class-A accuracies obtained by bi-leveland Figure 7. Class-B accuracies obtained by bi-level 
basic classifiers, categorized by classification and basic classifiers, categorized by classification 
algorithm exploited for the training process. Note that algorithm exploited for the training process. Note 
the results with KNN are obtained using K=1 that the results with KNN are obtained using K=1 
Proportion School of Management School of IT 
100.00% 92.10% E Class A 
85.13% 83.27% 


E Class B 





80.00% 


60.00% 


40.00% 


20.00% 





0.00% 
Bi-level Original Bi-level Original 


Figure 8. Comparison of class distributions between the entire original data (without clustering process) and 
those samples belonging to the cluster going through the second level of bi-level framework 


4. CONCLUSION 

This paper has presented an original work on the application of bi-level learning framework to 
determine patterns of student graduation. It is designed around a real collection of student enrollment and 
personal information. The proposed framework is divided into two tiers, with the initial applying a clustering 
technique to obtain clusters of student samples. A cluster of high quality is used as a reference for prediction, 
whereas those with the purity below a user-defined threshold are further analyzed using a choice of classifier. 
Evaluated on a data set specific to Mae Fah Luang University, the bi-level variations usually perform better 
than adopting simple classifiers to the whole data, or relying on the clustering result alone. This is due to the 
ability to solve the class imbalance to a certain extent. In fact, the application of Naive Bayes (NB) and 
Random Forest (RF) in the bi-level learning framework has proven more effective than other alternatives in 
this empirical study. While the former is the most accurate for Class B, the latter is exceptional for Claass A. 

Despite such a positive finding, there are a few issues that might lead to future works. In addition to 
the methodology of bi-level learning model, an oversampling or undersampling technique may well be 
exploited to resolve the problem of class imbalance further. Also, the concept of classifier ensemble may be 
useful to aggregate predictions made by different classifiers, which are deployed at the second level of 
proposed framework. Another direction is with the use of consensus clustering and recent variants to provide 
an accurate clustering in the intial layer of proposed model. 
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